Variable Selection from Random Forests: Application to Gene Expression Data

نویسنده

  • Ramón Díaz-Uriarte
چکیده

Random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its use for gene selection. We first show the effects of changes in parameters of random forest on the prediction error. Then we present an approach for gene selection that uses measures of variable importance and error rate, and is targeted towards the selection of small sets of genes. Using simulated and real microarray data, we show that the gene selection procedure yields small sets of genes while preserving predictive accuracy. We first show the effects of changes in parameters of random forest on the prediction error rate with microarray data. Then we present two approaches for gene selection with random forest: 1) comparing variable importance plots of variable importance from original and permuted data sets; 2) using backwards variable elimination. Using simulated and real microarray data, we show: 1) variable importance plots can be used to recover the full set of genes related to the outcome of interest, without being adversely affected by collinearities; 2) backwards variable elimination yields small sets of genes while preserving predictive accuracy (compared to several state-of-the art algorithms). Thus, both methods are useful for gene selection. All code is available as an R package, varSelRF, from CRAN http://cran.r-project.org/src/contrib/PACKAGES.html or from the supplementary material page. Supplementary information: http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest

Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...

متن کامل

Variable selection with Random Forests for missing data

Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a n...

متن کامل

Evaluation of variable selection methods for random forests and omics data sets.

Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the obj...

متن کامل

Prediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods

Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...

متن کامل

Variable selection for heavy-duty vehicle battery failure prognostics using random survival forests

Prognostics and health management is a useful tool for more flexible maintenance planning and increased system reliability. The application in this study is lead-acid battery failure prognosis for heavy-duty trucks which is important to avoid unplanned stops by the road. There are large amounts of data available, logged from trucks in operation. However, data is not closely related to battery h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004